Fast Kernel Methods for SVM Sequence Classifiers
نویسندگان
چکیده
In this work we study string kernel methods for sequence analysis and focus on the problem of species-level identification based on short DNA fragments known as barcodes. We introduce efficient sorting-based algorithms for exact string k-mer kernels and then describe a divide-and-conquer technique for kernels with mismatches. Our algorithm for the mismatch kernel matrix computation improves currently known time bounds for these computations. We then address the fast kernel vector evaluation problem during SVM testing. Finally, we introduce the mismatch kernel problem with feature selection, and present efficient algorithms for it. Our experimental results show that, for string kernels with mismatches, kernel matrices can be computed 100-200 times faster. Evaluations of kernel vectors on new sequences using new techniques also require far less time than the standard approaches. In experiments with several DNA barcode datasets, k-mer string kernels also considerably improve identification accuracy compared to prior results. String kernels with feature selection demonstrate enhanced or similar classification performance with substantially fewer computations than full feature kernels.
منابع مشابه
Comparison of single versus ensemble of Support Vector Machine classifiers using evolutionary information with Polynomial Kernel
For N-, O-, and C-linked glycosylation, we trained ensembles of Support Vector Machine (SVM) classifiers using evolutionary information to predict whether or not a site in a protein sequence is a glycosylation site. An ensemble of SVMs is simply a collection of SVM classifiers, each trained on a balanced subsample of the training data. The prediction of the ensemble is computed from the predict...
متن کاملSUBCLASS FUZZY-SVM CLASSIFIER AS AN EFFICIENT METHOD TO ENHANCE THE MASS DETECTION IN MAMMOGRAMS
This paper is concerned with the development of a novel classifier for automatic mass detection of mammograms, based on contourlet feature extraction in conjunction with statistical and fuzzy classifiers. In this method, mammograms are segmented into regions of interest (ROI) in order to extract features including geometrical and contourlet coefficients. The extracted features benefit from...
متن کاملA Fast Dual Method for HIK SVM Learning
Histograms are used in almost every aspect of computer vi4 4 sion, from visual descriptors to image representations. Histogram Inter5 5 section Kernel (HIK) and SVM classifiers are shown to be very effec6 6 tive in dealing with histograms. This paper presents three contributions 7 7 concerning HIK SVM classification. First, instead of limited to integer 8 8 histograms, we present a proof that H...
متن کاملThe Research on Email Classification Based on Q-gaussian Kernel Svm
The use of different kernel functions in SVM (Support Vector machines) has been reported in the literature. In this paper, the use of the q-Gaussian function as kernel function in SVM is investigated and q-Gaussian function is explored. While the q-Gaussian kernel SVM classifiers being built, cross validation methods are used to select the non-extensive entropic index q under varying feature si...
متن کاملAnalysis of Kernel Based Protein Classification Strategies Using Pairwise Sequence Alignment Measures
We evaluated methods of protein classification that use kernels built from BLAST output parameters. Protein sequences were represented as vectors of parameters (e.g. similarity scores) determined with respect to a reference set, and used in Support Vector Machines (SVM) as well as in simple nearest neighbor (1NN) classification. We found, using ROC analysis, that aggregate representations that ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007